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CLAIMS 



What is claimed is: 

1. A method for detecting similar documents comprising the steps of: 

taining a document; 
filtering the document to obtain a filtered document; 

determining a document identifier for the filtered document and a hash value for the 
filtered documW; 

generating a tuple for the filtered document, the tuple comprising the document identifier 
10 for the filtered document and the hash value for the filtered document; 

comparing tHe tuple for the filtered document with a document storage structure 
n comprising a pluralityXof tuples, each tuple in the plurality of tuples representing one of a 

plurality of documents, e^ch tuple in the plurality of tuples comprising a document identifier and 
a hash value; and 

15 determining if the U&le for the filtered document is clustered with another tuple in the 

document storage structure, tnereby detecting if the document is similar to another document 
represented by the another tuplain the document storage structure. 



iy 2. A method as in claim 1 , wherein the step of filtering comprises parsing the document, 

U \ 

q 20 and wherein the filtered document comprises a token stream, the token stream comprising a 

S plurality of tokens. 

3. A method as in claim 2, wherei?i the step of filtering further comprises retaining a 
token in the token stream as a retained tokei\according to at least one token threshold. 

25 

4. A method as in claim 3, wherein the step of filtering further comprises arranging the 
retained tokens in the token stream to obtain an ananged token stream. 

5. A method as in claim 3, wherein the step oMetermining the hash value for the filtered 
30 document comprises determining the hash value by processing individually each retained token 

in the token stream. 
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S&. A method as in claim 2, wherein the step of filtering further comprises: 

etermining a score for each token in the token stream; 
comparing the score for each token to a first token threshold; and 
moaifying the token stream by removing each token having a score not satisfying the first 
token threshold and retaining each token as a retained token having a score satisfying the first 
token threshold. 

7. A metftpd as in claim 6, wherein the step of filtering further comprises: 
comparing the score for each retained token to a second token threshold; and 
modifying th& token stream by removing each retained token having a score not 

satisfying the second taken threshold and retaining each retained token having a score satisfying 
the second token threshold. 

8. A method as in claim 2, wherein the step of filtering further comprises removing from 
the token stream at least onJt token corresponding to a stop word. 

9. A method as in claim 2, wherein the step of filtering further comprises removing a 
token from the token stream if me token is a duplicate of another token in the token stream. 

16L A method as in claim 2, wherein the step of filtering further comprises removing a 
token from\{he token stream if the token is either a very frequent token or a very infrequent 
token. 

M . A method as in claim 2, wherein the step of filtering comprises removing at least one 
token frorX the token stream. 

12. A method as in claim 1, wherein the step of filtering comprises removing formatting 
from the document. 
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13. A method as in claim 1, wherein the step of filtering uses collection statistics for 
filtering the document. 

Iw. A method as in claim 13, wherein the collection statistics pertain to the plurality of 
5 documentV 

15. A method as in claim 1, wherein the step of determining the hash value for the 
filtered document comprises using a hash algorithm to determine the hash value, the hash 
algorithm haviiig an approximately even distribution of hash values. 



16. A method as in claim 1, wherein the step of determining the hash value for the 
filtered document comprises using a standard hash algorithm to determine the hash value. 



3U 
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17. A method as in claim 1, wherein the step of determining the hash value for the 
filtered document comprises using a secure hash algorithm to determine the hash value. 



18. A method as\n claim 1, wherein the step of determining the hash value for the 
filtered document comprises using hash algorithm SHA-1 to determine the hash value. 



H=20 



19. A method as in claim 1, wherein the document storage structure comprises a hash 



table. 



20. A method as in claimXl, wherein the document storage structure comprises a tree. 
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21 . A method as in claim 2(L wherein the tree comprises a binary tree. 



22. A method as in claim 21, wherein the binary tree comprises a binary balanced tree. 



23. A method as in claim 1, wherein the document storage structure comprises a hash 
30 table and at least one tree. 
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\ 24. A method as in claim 1 , wherein the step of comparing comprises inserting the tuple 
into thadocument storage structure. 

2 A A method as in claim 1, wherein the document storage structure comprises a hash 
table, the h^sh table comprising a plurality of bins, each bin of the hash table comprising at least 
one tuple of the plurality of tuples, and 

whereia the step of determining if the tuple is clustered with another tuple comprises 
determining if the tuple is co-located with another tuple at a bin of the hash table. 

26. A metttpd as in claim 1, wherein the document storage structure comprises a tree, the 
tree comprising a plurality of branches, each bucket of the tree comprising at least one tuple of 
the plurality of tuples\ and 

wherein the sten of determining if the tuple is clustered with another tuple comprises 
determining if the tuple is co-located with another tuple in a bucket of the tree. 

27. A computer forVerforming the method of claim 1. 

28. A computer-readable medium having software for performing the method of claim 1 . 

29. An apparatus for detecting similar documents comprising: 
means for obtaining a document; 

means for filtering the document to obtain a filtered document; 

means for determining a document identifier for the filtered document and a hash value 
for the filtered document; \ 

means for generating a tuple for trie filtered document, the tuple comprising the document 
identifier for the filtered document and the riash value for the filtered document; 

means for comparing the'tuple for tha filtered document with a document storage 
structure comprising a plurality of tuples, eac&tuple in the plurality of tuples representing one of 
a plurality of documents, each tuple in the plurality of tuples comprising a document identifier 
and a hash value; and \ 
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means for determining if the tuple for the filtered document is clustered with another 
tuple in tne document storage structure, thereby detecting if the document is similar to another 
document Represented by the another tuple in the document storage structure. 

30. /^method for detecting similar documents comprising the steps of: 
obtaining a document; 

parsing me document to remove formatting and to obtain a token stream, the token 
stream comprising a plurality of tokens; 

retaining qnly retained tokens in the token stream by using at least one token threshold; 

arranging me retained tokens to obtain an arranged token stream; 

processing m turn each retained token in the arranged token stream using a hash 
algorithm to obtain amash value for the document; 

generating a document identifier for the document; 

forming a tuple Vor the document, the tuple comprising the document identifier for the 
document and the hash \nalue for the document; 

inserting the tuple Vor the document into a document storage tree, the document storage 
tree comprising a plurality of tuples, each tuple located at a bucket of the document storage tree, 
each tuple in the plurality orauples representing one of a plurality of documents, each tuple in the 
plurality of tuples comprising document identifier and a hash value; and 

determining if the tupley for the document is co-located with another tuple at a same 
bucket in the document storage \ree, thereby detecting if the document is similar to another 
document represented by the anottier tuple in the document storage tree. 

3 1 . A computer for perforating the method of claim 30. 

32. A computer-readable medmm having software for performing the method of claim 
30. \ 

33. An apparatus for detecting similar documents comprising: 



-28- 



(7519-164345) 



# 

Weans for parsing the document to remove formatting and to obtain a token stream, the 
token sueam comprising a plurality of tokens; 

mfeans for retaining only retained tokens in the token stream by using at least one token 
threshold; \ 

means for arranging the retained tokens to obtain an arranged token stream; 
mean* for processing in turn each retained token in the arranged token stream using a 
hash algorithrn to obtain a hash value for the document; 

means mr generating a document identifier for the document; 

means fo\ forming a tuple for the document, the tuple comprising the document identifier 
for the document and the hash value for the document; 

means for inserting the tuple for the document into a document storage tree, the 
document storage trekcomprising a plurality of tuples, each tuple located at a bucket of the 
document storage tree,\ach tuple in the plurality of tuples representing one of a plurality of 
documents, each tuple inVhe plurality of tuples comprising a document identifier and a hash 
value; and \ 

means for determining if the tuple for the document is co-located with another tuple at a 
same bucket in the documentWorage tree, thereby detecting if the document is similar to another 
document represented by the another tuple in the document storage tree. 

34. A method for detecting similar documents comprising the steps of: 
determining a hash value far a document; 

accessing a document storage structure comprising a plurality of hash values, each hash 
value in the plurality of hash values representing one of a plurality of documents; and 

determining if the hash value for the document is equivalent to another hash value in the 
document storage structure, thereby detecting if the document is similar to another document 
represented by the another hash value in uie document storage structure. 

35. A computer for performing the method of claim 34. 

36. A computer-readable medium havflag software for performing the method of claim 
34. \ 
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37. An apparatus for detecting similar documents comprising: 
\means for determining a hash value for a document; 
leans for accessing a document storage structure comprising a plurality of hash values, 
each hast value in the plurality of hash values representing one of a plurality of documents; and 

mians for determining if the hash value for the document is equivalent to another hash 
value in th« document storage structure, thereby detecting if the document is similar to another 
document represented by the another hash value in the document storage structure. 

10 38. A method for detecting similar documents comprising the step of: 

comparing a document to a plurality of documents in a document collection using a hash 
□ algorithm and collection statistics to detect if the document is similar to any of the documents in 
the document collection. 
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39. A method as in claim 38, wherein the collection statistics pertain to the document 
collection. 



3 
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40. A computer For performing the method of claim 38. 
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38. 



41 . A computer-readable medium having software for performing the method of claim 
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42. An apparatus for detecting similar documents comprising: 

means for comparing a\document to a plurality of documents in a document collection 
using a hash algorithm and collection statistics to detect if the document is similar to any of the 
documents in the document collection. 



/ 
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